May 26, 2020

Let’s Get Started

  • Download R & R Studio
  • Open R Studio and Explore the Console
  • Helpful Reading: Hadley Wickham’s book

What is R?

  • R is a programming language that is commonly used for data science
  • R Studio is the IDE that allows you to code in R (and other languages like SQL and python)
    • IDE stands for Integrated Development Environment
    • Allows you to write functions and operations
    • Tidy and Visualize data

RStudio

  • You have the Console, Environment, Files and Packages

What is Markdown?

  • R Markdown is the tool for you to report and present your code / output
    • File -> New File -> R Markdown -> Document -> HTML
  • Set output to different file formats (with the Knit button)

Markdown Syntax

  • Need to learn markdown syntax
  • Use the cheat sheet as a reference (google markdown cheat sheet)
  • You can use Markdown to embed formatting instructions into your text. For example, you can make a word italicized by surrounding it in asterisks, bold by surrounding it in two asterisks, and monospaced (like code) by surrounding it in backticks:

*italics*, **bold**, `code`

  • You can turn a word into a link by surrounding it in hard brackets and then placing the link behind it in parentheses, like this:

[Columbia U](www.columbia.edu)

R Markdown Cheat Sheet

R Markdown Cheat Sheet

Headers

To create titles and headers, use leading hastags. The number of hashtags determines the header’s level:

# First level header
## Second level header
### Third level header

Lists

To make a bulleted list in Markdown, place each item on a new line after an asterisk and a space, like this:

* item 1
* item 2
* item 3

You can make an ordered list by placing each item on a new line after a number followed by a period followed by a space.

1. item 1
2. item 2
3. item 3

Embedding equations

You can also use the Markdown syntax to embed latex math equations into your reports. To embed an equation in its own centered equation block, surround the equation with two pairs of dollar signs like this,

$$1 + 1 = 2$$

To embed an equation inline, surround it with a single pair of dollar signs, like this: $1 + 1 = 2$

All standard Latex symbols work.

Including R code inline and in chunks

  • R code can be included as chunk with

    ```{r} ```

    or inline with a single tickmark.

  • R functions sometimes return messages, warnings, and even error messages. By default, R Markdown will include these messages in your report. You can use the message, warning and error options to prevent R Markdown from displaying these.

  • Keyboard Shortcut to create a new chunk is command + option + I

Popular chunk options

Knitr

knitr is an engine for dynamic report generation with R and is used to convert (or “knit”) R Markdown files into the desired output format.

Other Output Formats

  • html_document
  • pdf_document
  • word_document
  • beamer_presentation / slidy_presentation / ioslides_presentation
  • github_document

Packages and Dependencies

  • Installing packages
#install.packages("dplyr")
library(dplyr)
  • Or use a package manager, e.g. Pacman
#install.packages("pacman")
library(pacman)
p_load(dplyr, ggplot2)

Core Packages in R

  • ggplot2 (graphics)
  • tibble (data frames and tables)
  • tidyr (make tidy)
  • readr (read in tabular formats)
  • purrr (functional programming)
  • dplyr (manipulate data)
  • tidyverse (All the above)

Importing / Reading in Data

# Using the data.table package to read files
p_load(data.table)
flights <- fread("https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv")
# Using the readxl package to read in Excel files
library(readxl)
rawData <- read_excel(path = "data/data_example1.xlsx", # Path to file
                    sheet = 2, # We want the second sheet
                    skip = 1, # Skip the first row
                    na = "NA") # Missing characters are "NA"

Take a Look at the Data

head(flights) # head() / tail() to show 5 top/bottom rows
##    year month day dep_delay arr_delay carrier origin dest air_time distance
## 1: 2014     1   1        14        13      AA    JFK  LAX      359     2475
## 2: 2014     1   1        -3        13      AA    JFK  LAX      363     2475
## 3: 2014     1   1         2         9      AA    JFK  LAX      351     2475
## 4: 2014     1   1        -8       -26      AA    LGA  PBI      157     1035
## 5: 2014     1   1         2         1      AA    JFK  LAX      350     2475
## 6: 2014     1   1         4         0      AA    EWR  LAX      339     2454
##    hour
## 1:    9
## 2:   11
## 3:   19
## 4:    7
## 5:   13
## 6:   18

Another Way to Look at Data

dim(flights) # Get the shape of the data
## [1] 253316     11
colnames(flights) # Get the column names
##  [1] "year"      "month"     "day"       "dep_delay" "arr_delay" "carrier"  
##  [7] "origin"    "dest"      "air_time"  "distance"  "hour"

Seeking help

  • Look at the help tab of your console
?ggplot2
help(dplyr)

Data Science Flow Chart

  • We have just explored the 1st part (Importing data)
  • This course will focus on Tidying, Transforming and some Visualisation
  • Not a full-fledged Statistics or Machine Learning Course

Why learn Data Science & R?

  • In-demand skill
  • Format of Data is changing and traditional tools like Microsoft Excel is insufficient for certain tools and functions
  • Easy to carry out functions like Webscrapping, Machine Learning, Statistical Analysis and Web / Dashboard building
  • R is an easy first programming language to learn
  • R is free, open source, 14,837 packages available
  • Rstudio is an arguably better IDE than Jupyter (IMO it is the Apple to the Microsoft)

Possible Projects

What will we cover next?

  • Class 2:
    • Simple data manipulation with dplyr
    • Data visualization with ggplot2
  • Class 3:
    • Function Writing with purrr
    • Some Statistical Analysis with psych and base R
  • Class 4:
    • Statistical Analysis (continued)
    • Basic Machine Learning Concepts with caret

What will we cover next?

  • Class 5:
    • Working with strings using stringr and rebus packages
    • Simple NLP using tidytext, tm and wordcloud
  • Class 6:
    • Webscrapping with rvest
    • API with httr
  • Class 7:
    • Dashboard & Website Building with shiny